Airline Data Challenge¶

Problem Statement:¶

You are consulting for an airline company looking to enter the United States domestic market which has identified medium and large airports as their desired operating locations. The company believes that it has a competitive advantage in maintaining punctuality, so it plans on making this a big part of its brand image with a motto, “On time, for you.” To kick start operations, the company has decided to start with 5 round trip routes. An example of a round trip route is the combination of JFK to ORD and ORD to JFK. The opposite order of the route, ORD to JFK and JFK to ORD, would be considered the same round trip.

Import Necessary Libraries¶

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import plotly.express as px

Quality Check Data¶

Data Quality Observations:¶

  • In the Flights.csv dataset, columns 3,13 and 14 have mixed datatypes.
  • Column 3, OP_CARRIER_FL_NUM, should be a string so we will use the str datatype here.
  • Column 13, AIR_TIME in minutes, should be a datatype of int, so we will use the str datatype and investigate further in next steps.
  • Column 14, DISTANCE in miles, should be a datatype of int, so we will use the str datatype and investigate further in next steps.

Reading Datasets¶

In [2]:
#read airport codes dataset
airport_codes = pd.read_csv('Airport_Codes.csv')

#read flights dataset
flights = pd.read_csv('Flights.csv', dtype = {'OP_CARRIER_FL_NUM':str,'AIR_TIME':str,'DISTANCE':str})

#read tickets dataset
tickets = pd.read_csv('Tickets.csv')

Investigating Mixed Datatypes¶

We will first check to see what the impact would be, to both the AIR_TIME and DISTANCE columns, of losing values that are not of the numeric datatype. If the impact is small, we can assume this will not have a significant impact on our overall analysis.

In [3]:
#finding percent of nulls in the air_time column
air_time_null_percentage = (flights['AIR_TIME'].isnull().sum() / len(flights['AIR_TIME'])) * 100

#printing the output
print(f"Percentage of null values in 'AIR_TIME' column is {air_time_null_percentage:.2f}%")
Percentage of null values in 'AIR_TIME' column is 2.95%
In [4]:
#creating a test air_time column before converting values to the numeric datatype
flights['AIR_TIME_TEST'] = flights['AIR_TIME']

#converting test air_time column to numeric datatype and coercing non-convertible values
flights['AIR_TIME_TEST'] = pd.to_numeric(flights['AIR_TIME_TEST'], errors='coerce')

#finding percent of nulls in the air_time test column post-conversion
air_time_test_null_percentage = (flights['AIR_TIME_TEST'].isnull().sum() / len(flights['AIR_TIME_TEST'])) * 100

#printing the output
print(f"Percentage of null values in 'AIR_TIME_TEST' column is {air_time_test_null_percentage:.2f}%")
Percentage of null values in 'AIR_TIME_TEST' column is 3.05%

Data Quality Insight: The percentage of null values in the AIR_TIME_TEST column increases by 0.1%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original AIR_TIME column and coercing non-convertible values.

In [5]:
#dropping air_time test column
flights = flights.drop(columns=['AIR_TIME_TEST'])

#converting air_time column to numeric datatype and coercing non-convertible values
flights['AIR_TIME'] = pd.to_numeric(flights['AIR_TIME'], errors='coerce')

We can now do the same thing for the DISTANCE column.

In [6]:
#finding percent of nulls in the distance column
distance_null_percentage = (flights['DISTANCE'].isnull().sum() / len(flights['DISTANCE'])) * 100

#printing the output
print(f"Percentage of null values in 'DISTANCE' column is {distance_null_percentage:.2f}%")
Percentage of null values in 'DISTANCE' column is 0.03%
In [7]:
#creating a test distance column before converting values to the numeric datatype
flights['DISTANCE_TEST'] = flights['DISTANCE']

#converting test distance column to numeric datatype and coercing non-convertible values
flights['DISTANCE_TEST'] = pd.to_numeric(flights['DISTANCE_TEST'], errors='coerce')

#finding percent of nulls in the distance test column post-conversion
distance_test_null_percentage = (flights['DISTANCE_TEST'].isnull().sum() / len(flights['DISTANCE_TEST'])) * 100

#printing the output
print(f"Percentage of null values in 'DISTANCE_TEST' column is {distance_test_null_percentage:.2f}%")
Percentage of null values in 'DISTANCE_TEST' column is 0.14%

Data Quality Insight: The percentage of null values in the DISTANCE_TEST column increases by 0.1%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original DISTANCE column and coercing non-convertible values.

In [8]:
#dropping distance test column
flights = flights.drop(columns=['DISTANCE_TEST'])

#converting distance column to numeric datatype and coercing non-convertible values
flights['DISTANCE'] = pd.to_numeric(flights['DISTANCE'], errors='coerce')

Printing Dataset Information & Checking Column Values¶

In [9]:
#printing airport_codes dataset info
airport_codes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55369 entries, 0 to 55368
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   TYPE          55369 non-null  object 
 1   NAME          55369 non-null  object 
 2   ELEVATION_FT  48354 non-null  float64
 3   CONTINENT     27526 non-null  object 
 4   ISO_COUNTRY   55122 non-null  object 
 5   MUNICIPALITY  49663 non-null  object 
 6   IATA_CODE     9182 non-null   object 
 7   COORDINATES   55369 non-null  object 
dtypes: float64(1), object(7)
memory usage: 3.4+ MB
In [10]:
#finding percent of null iata_codes in airport dataset
iata_code_null_percentage = (airport_codes['IATA_CODE'].isnull().sum() / len(airport_codes['IATA_CODE'])) * 100

#printing the output
print(f"Percentage of null values in 'IATA_CODE' column is {iata_code_null_percentage:.2f}%")
Percentage of null values in 'IATA_CODE' column is 83.42%
In [11]:
#filtering the DataFrame to get rows where 'IATA_CODE' is null
us_airports = airport_codes[(airport_codes['ISO_COUNTRY'] == 'US')]

#getting the value counts of 'TYPE' in 'missing_iata_code'
type_counts_us_aiports = us_airports['TYPE'].value_counts()

#calculating the percentage of each value relative to the total count
total_us_aiports = type_counts_us_aiports.sum()
percentages = (type_counts_us_aiports / total_us_aiports) * 100

#combining the value counts and percentages into one dataframe
result_df = pd.DataFrame({'Value Counts': type_counts_us_aiports, 'Percentage of Total':percentages.apply(lambda x: f'{x:.2f}%')})

result_df
Out[11]:
Value Counts Percentage of Total
TYPE
small_airport 13708 60.10%
heliport 6268 27.48%
closed 1392 6.10%
medium_airport 687 3.01%
seaplane_base 566 2.48%
large_airport 171 0.75%
balloonport 18 0.08%
In [12]:
#checking for duplicate IATA_CODES 
duplicates = airport_codes.duplicated(subset=['IATA_CODE'], keep=False)

# Get the rows with duplicate IATA_CODES
duplicate_rows = airport_codes[duplicates & ~airport_codes['IATA_CODE'].isnull()]

# Counting the distinct 'TYPE' values and their occurrences for each duplicate 'IATA_CODE'
duplicate_counts = duplicate_rows.groupby(['IATA_CODE', 'TYPE']).size().reset_index()
duplicate_counts.columns = ['IATA_CODE', 'TYPE', 'Count']

duplicate_counts
Out[12]:
IATA_CODE TYPE Count
0 0 small_airport 80
1 AHT closed 2
2 ARX closed 1
3 ARX medium_airport 1
4 AUS closed 1
5 AUS large_airport 1
6 CDT medium_airport 2
7 CLG closed 1
8 CLG small_airport 1
9 DDU small_airport 2
10 DLR small_airport 2
11 DZI small_airport 2
12 ESP closed 1
13 ESP small_airport 1
14 GGC small_airport 2
15 HKG closed 1
16 HKG large_airport 1
17 IZA medium_airport 2
18 JNB large_airport 3
19 KCZ medium_airport 2
20 KMM small_airport 2
21 KWB small_airport 2
22 LHR closed 1
23 LHR large_airport 1
24 LMC small_airport 2
25 LPE small_airport 2
26 MNI closed 1
27 MNI medium_airport 1
28 MPT small_airport 2
29 MRE medium_airport 1
30 MRE small_airport 1
31 MUC closed 1
32 MUC large_airport 1
33 MXR small_airport 2
34 NWT small_airport 2
35 PCO small_airport 2
36 PHM closed 3
37 PRM heliport 1
38 PRM medium_airport 1
39 RCH medium_airport 1
40 RCH small_airport 1
41 RMD small_airport 2
42 RTI small_airport 2
43 RZS small_airport 2
44 SGL heliport 1
45 SGL small_airport 1
46 SHO closed 1
47 SHO medium_airport 1
48 SVD closed 1
49 SVD medium_airport 1
50 ULG small_airport 2
51 VQS closed 1
52 VQS small_airport 1
53 YTY medium_airport 2
54 ZRZ small_airport 2

Data Quality Insight: airports_codes contains duplicate IATA_CODES, so we will need to deduplicate thia dataset, while prioritizing rows where TYPE = medium_airport. Since we are only concerned with medium & large airports, we will create a new dataframe where TYPE is one of these two.

In [13]:
#filtering for rows where 'TYPE' is 'medium_airport' or 'large_airport'
filtered_airport_codes = airport_codes[airport_codes['TYPE'].isin(['medium_airport', 'large_airport'])]

filtered_airport_codes.head()
Out[13]:
TYPE NAME ELEVATION_FT CONTINENT ISO_COUNTRY MUNICIPALITY IATA_CODE COORDINATES
6194 medium_airport Aleknagik / New Airport 66.0 NaN US Aleknagik WKK -158.617996216, 59.2826004028
10444 medium_airport Honiara International Airport 28.0 OC SB Honiara HIR 160.05499267578, -9.4280004501343
10449 medium_airport Munda Airport 10.0 OC SB NaN MUA 157.26300048828125, -8.327969551086426
10471 medium_airport Hongyuan Airport 11600.0 AS CN Aba AHJ 102.35224, 32.53154
10695 medium_airport Nauru International Airport 22.0 OC NR Yaren District INU 166.919006, -0.547458
In [14]:
#printing flights dataset info
flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1915886 entries, 0 to 1915885
Data columns (total 16 columns):
 #   Column             Dtype  
---  ------             -----  
 0   FL_DATE            object 
 1   OP_CARRIER         object 
 2   TAIL_NUM           object 
 3   OP_CARRIER_FL_NUM  object 
 4   ORIGIN_AIRPORT_ID  int64  
 5   ORIGIN             object 
 6   ORIGIN_CITY_NAME   object 
 7   DEST_AIRPORT_ID    int64  
 8   DESTINATION        object 
 9   DEST_CITY_NAME     object 
 10  DEP_DELAY          float64
 11  ARR_DELAY          float64
 12  CANCELLED          float64
 13  AIR_TIME           float64
 14  DISTANCE           float64
 15  OCCUPANCY_RATE     float64
dtypes: float64(6), int64(2), object(8)
memory usage: 233.9+ MB
In [15]:
flights.head()
Out[15]:
FL_DATE OP_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN_AIRPORT_ID ORIGIN ORIGIN_CITY_NAME DEST_AIRPORT_ID DESTINATION DEST_CITY_NAME DEP_DELAY ARR_DELAY CANCELLED AIR_TIME DISTANCE OCCUPANCY_RATE
0 2019-03-02 WN N955WN 4591 14635 RSW Fort Myers, FL 11042 CLE Cleveland, OH -8.0 -6.0 0.0 143.0 1025.0 0.97
1 2019-03-02 WN N8686A 3231 14635 RSW Fort Myers, FL 11066 CMH Columbus, OH 1.0 5.0 0.0 135.0 930.0 0.55
2 2019-03-02 WN N201LV 3383 14635 RSW Fort Myers, FL 11066 CMH Columbus, OH 0.0 4.0 0.0 132.0 930.0 0.91
3 2019-03-02 WN N413WN 5498 14635 RSW Fort Myers, FL 11066 CMH Columbus, OH 11.0 14.0 0.0 136.0 930.0 0.67
4 2019-03-02 WN N7832A 6933 14635 RSW Fort Myers, FL 11259 DAL Dallas, TX 0.0 -17.0 0.0 151.0 1005.0 0.62
In [16]:
#printing tickets dataset info
tickets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167285 entries, 0 to 1167284
Data columns (total 12 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   ITIN_ID            1167285 non-null  int64  
 1   YEAR               1167285 non-null  int64  
 2   QUARTER            1167285 non-null  int64  
 3   ORIGIN             1167285 non-null  object 
 4   ORIGIN_COUNTRY     1167285 non-null  object 
 5   ORIGIN_STATE_ABR   1167285 non-null  object 
 6   ORIGIN_STATE_NM    1167285 non-null  object 
 7   ROUNDTRIP          1167285 non-null  float64
 8   REPORTING_CARRIER  1167285 non-null  object 
 9   PASSENGERS         1165308 non-null  float64
 10  ITIN_FARE          1166325 non-null  object 
 11  DESTINATION        1167285 non-null  object 
dtypes: float64(2), int64(3), object(7)
memory usage: 106.9+ MB

Data Quality Insight: At the moment, the ITIN_FARE column is of the object datatype, but we will need this column to a numeric datatype in order to answer Ask #2. We will follow the steps we did above for both the DISTANCE and AIR_TIME columns in the flights dataset.

In [17]:
#finding percent of nulls in the itin_fare column
itin_fare_null_percentage = (tickets['ITIN_FARE'].isnull().sum() / len(tickets['ITIN_FARE'])) * 100

#printing the output
print(f"Percentage of null values in 'ITIN_FARE' column is {itin_fare_null_percentage:.2f}%")
Percentage of null values in 'ITIN_FARE' column is 0.08%
In [18]:
#creating a test distance column before converting values to the numeric datatype
tickets['ITIN_FARE_TEST'] = tickets['ITIN_FARE']

#converting test distance column to numeric datatype and coercing non-convertible values
tickets['ITIN_FARE_TEST'] = pd.to_numeric(tickets['ITIN_FARE_TEST'], errors='coerce')

#finding percent of nulls in the distance test column post-conversion
itin_fare_test_null_percentage = (tickets['ITIN_FARE_TEST'].isnull().sum() / len(tickets['ITIN_FARE_TEST'])) * 100

#printing the output
print(f"Percentage of null values in 'ITIN_FARE_TEST' column is {itin_fare_test_null_percentage:.2f}%")
Percentage of null values in 'ITIN_FARE_TEST' column is 0.35%

Data Quality Insight: The percentage of null values in the ITIN_FARE_TEST column increases by 0.3%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original ITIN_FARE column and coercing non-convertible values.

In [19]:
#dropping distance test column
tickets = tickets.drop(columns=['ITIN_FARE_TEST'])

#converting distance column to numeric datatype and coercing non-convertible values
tickets['ITIN_FARE'] = pd.to_numeric(tickets['ITIN_FARE'], errors='coerce')

#verifying the data type conversion was excuted correctly
tickets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1167285 entries, 0 to 1167284
Data columns (total 12 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   ITIN_ID            1167285 non-null  int64  
 1   YEAR               1167285 non-null  int64  
 2   QUARTER            1167285 non-null  int64  
 3   ORIGIN             1167285 non-null  object 
 4   ORIGIN_COUNTRY     1167285 non-null  object 
 5   ORIGIN_STATE_ABR   1167285 non-null  object 
 6   ORIGIN_STATE_NM    1167285 non-null  object 
 7   ROUNDTRIP          1167285 non-null  float64
 8   REPORTING_CARRIER  1167285 non-null  object 
 9   PASSENGERS         1165308 non-null  float64
 10  ITIN_FARE          1163149 non-null  float64
 11  DESTINATION        1167285 non-null  object 
dtypes: float64(3), int64(3), object(6)
memory usage: 106.9+ MB
In [20]:
tickets.head()
Out[20]:
ITIN_ID YEAR QUARTER ORIGIN ORIGIN_COUNTRY ORIGIN_STATE_ABR ORIGIN_STATE_NM ROUNDTRIP REPORTING_CARRIER PASSENGERS ITIN_FARE DESTINATION
0 201912723049 2019 1 ABI US TX Texas 1.0 MQ 1.0 736.0 DAB
1 201912723085 2019 1 ABI US TX Texas 1.0 MQ 1.0 570.0 COS
2 201912723491 2019 1 ABI US TX Texas 1.0 MQ 1.0 564.0 MCO
3 201912723428 2019 1 ABI US TX Texas 1.0 MQ 1.0 345.0 LGA
4 201912723509 2019 1 ABI US TX Texas 0.0 MQ 1.0 309.0 MGM

Ask #1¶

Find the 10 busiest round trip routes in terms of number of round trip flights in the quarter. Exclude canceled flights when performing the calculation.

Data Quality Insight: There is not a clean way to join flights to tickets in order to determine which flights are roundtrip flights and which are not, but we can use the combination of ORIGIN and DESTINATION to create a ROUNDTRIP_ROUTE column. To do this we will concatenate both columns, using the sorted() function to ensure the same roundtrip route is assigned the two scenarios outlined above (and below).

"An example of a round trip route is the combination of JFK to ORD and ORD to JFK. The opposite order of the route, ORD to JFK and JFK to ORD, would be considered the same round trip."

In [21]:
#creating 'ROUNDTRIP_ROUTE' column in flights dataset
flights['ROUNDTRIP_ROUTE'] = flights.apply(lambda row: '-'.join(sorted([row['ORIGIN'], row['DESTINATION']])), axis=1)

#filtering for non-canceled flights
non_canceled_flights = flights[flights['CANCELLED'] == 0]

#getting counts for top 10 roundtrip routes
unique_routes_counts = non_canceled_flights['ROUNDTRIP_ROUTE'].value_counts().head(10).reset_index()

#renaming the columns in unique_routes_counts
unique_routes_counts.columns = ['ROUNDTRIP_ROUTE', 'COUNT']

#printing output
unique_routes_counts
Out[21]:
ROUNDTRIP_ROUTE COUNT
0 LAX-SFO 8340
1 LGA-ORD 7156
2 LAS-LAX 6511
3 JFK-LAX 6320
4 LAX-SEA 4999
5 BOS-LGA 4820
6 HNL-OGG 4794
7 PDX-SEA 4774
8 ATL-MCO 4707
9 ATL-LGA 4594
In [22]:
#creating bar chart for 10 Busiest Round Trip Routes
fig = px.bar(unique_routes_counts, x='ROUNDTRIP_ROUTE', y='COUNT', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'COUNT': 'Trip Count'},
             title='10 Busiest Round Trip Routes')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()

Ask #2¶

The 10 most profitable round trip routes (without considering the upfront airplane cost) in the quarter. Along with the profit, show total revenue, total cost, summary values of other key components and total round trip flights in the quarter for the top 10 most profitable routes. Exclude canceled flights from these calculations.

Data Quality Insight: In order to calculate round trip route profits, we must:

  1. Create a new column in the tickets dataframe, as we previously did in the flights dataset, which we will call ROUNDTRIP_ROUTE.
  2. Create a new dataframe where we will aggregate revenue & costs by ROUNDTRIP_ROUTE.
  3. Calculate the revenue associated with each round trip based on:
    • Average round trip route fare
    • Baggage fees (charged twice for both legs of the trip)
    • Average flight occupancy
  4. Calculate the costs associated with each round trip route based on:
    • Airplane Fuel/Oil/Maintenance/Crew costs (\$8 per mile).
    • Airplane Depreciation/Insurance/Other costs (\$1.18 per mile).
    • Airport operational costs (charged twice and based on the aiport size of both orgin & destination airports).
    • Additional operational costs associated with delays (Delays that are 15 minutes or less are free, however each additional minute of delay costs the airline \$75 in added operational costs).
In [23]:
#creating 'ROUNDTRIP_ROUTE' column in tickets dataset
tickets['ROUNDTRIP_ROUTE'] = tickets.apply(lambda row: '-'.join(sorted([row['ORIGIN'], row['DESTINATION']])), axis=1)

#filtering tickets for round trips
round_trip_tickets = tickets[tickets['ROUNDTRIP'] == 1]

#calculating average rountrip route fare
avg_itin_fare = round_trip_tickets.groupby('ROUNDTRIP_ROUTE')['ITIN_FARE'].mean().reset_index()

#calcualting sum of miles and average occupancy rate by ROUNDTRIP_ROUTE usinf non_canceled_flights
round_trip_flight_info = non_canceled_flights.groupby('ROUNDTRIP_ROUTE').agg({
    'ORIGIN': 'first',#adding the first 'ORIGIN' value associated with each route
    'DESTINATION': 'first', #adding the first 'DESTINATION' value associated with each route
    'ARR_DELAY': 'mean', # calculating the average arrival delay for each route
    'DEP_DELAY': 'mean', # calculating the average departure delay for each route
    'DISTANCE': 'first', # getting the first DISTANCE value associated with each route (each route has the ONE specified distance in the flights dataset so no need to grab the mean here)
    'OCCUPANCY_RATE': 'mean', # calculating the average occupancy_rate for each route
}).reset_index()


#merging round_trip_flight_info and avg_itin_fare on'ROUNDTRIP_ROUTE' to create 'round_trips'
round_trips = pd.merge(round_trip_flight_info, avg_itin_fare, on='ROUNDTRIP_ROUTE')

#renaming column names 
round_trips.rename(columns={
    'ARR_DELAY': 'AVERAGE_ARR_DELAY',
    'DEP_DELAY': 'AVERAGE_DEP_DELAY',
    'OCCUPANCY_RATE': 'AVERAGE_OCCUPANCY_RATE',
    'ITIN_FARE': 'AVERAGE_ITIN_FARE'
}, inplace=True)

#getting value_counts for each 'ROUNDTRIP_ROUTE' in 'non_canceled_flights'
route_value_counts = non_canceled_flights['ROUNDTRIP_ROUTE'].value_counts().reset_index()
route_value_counts.columns = ['ROUNDTRIP_ROUTE', 'TOTAL_FLIGHTS']

#merging the value_counts with 'round_trips'
round_trips = pd.merge(round_trips, route_value_counts, on='ROUNDTRIP_ROUTE', how='left')

round_trips
Out[23]:
ROUNDTRIP_ROUTE ORIGIN DESTINATION AVERAGE_ARR_DELAY AVERAGE_DEP_DELAY DISTANCE AVERAGE_OCCUPANCY_RATE AVERAGE_ITIN_FARE TOTAL_FLIGHTS
0 ABE-ATL ATL ABE 2.557604 6.873272 692.0 0.654332 557.141176 434
1 ABE-CLT ABE CLT 1.468127 3.838645 481.0 0.661096 486.394737 502
2 ABE-DTW ABE DTW 5.981855 13.663984 425.0 0.645392 385.809524 497
3 ABE-FLL ABE FLL 4.250000 5.400000 1041.0 0.605250 248.734375 40
4 ABE-ORD ABE ORD 23.666667 29.275000 654.0 0.649094 548.354839 320
... ... ... ... ... ... ... ... ... ...
2930 STL-TPA STL TPA 5.073810 11.299287 869.0 0.658480 375.192982 421
2931 STL-TUL STL TUL 0.265432 6.193846 351.0 0.646308 392.350000 325
2932 SYR-TPA TPA SYR 9.500000 22.486111 1104.0 0.615556 308.512500 72
2933 TPA-TTN TTN TPA 7.557692 14.987261 955.0 0.639427 162.054054 157
2934 VPS-XNA VPS XNA 24.760000 20.360000 601.0 0.605600 270.900000 25

2935 rows × 9 columns

In [24]:
#calculating average_fare_bookings for each round trip route
round_trips['AVERAGE_FARE_BOOKINGS'] = (round_trips['AVERAGE_OCCUPANCY_RATE'] * 200) * round_trips['AVERAGE_ITIN_FARE']


#calculating average_baggage_fees
round_trips['AVERAGE_BAGGAGE_FEES'] = (round_trips['AVERAGE_OCCUPANCY_RATE'] * 200) * 70

#calculating average_airplane_costs for each round trip route
round_trips['AVERAGE_AIRPLANE_COST'] = 9.18 * round_trips['DISTANCE']

#creating function to calculate average arrival/delay costs based on criteria provided in prompt
def calculate_cost(delay_minutes):
    if delay_minutes <= 15:
        return 0
    else:
        return (delay_minutes - 15) * 75

#calcualting average arrival delay costs each round trip route
round_trips['AVERAGE_ARR_DELAY_COST'] = round_trips['AVERAGE_ARR_DELAY'].apply(calculate_cost)

#calcualting average arrival delay costs each round trip route
round_trips['AVERAGE_DEP_DELAY_COST'] = round_trips['AVERAGE_DEP_DELAY'].apply(calculate_cost)

round_trips
Out[24]:
ROUNDTRIP_ROUTE ORIGIN DESTINATION AVERAGE_ARR_DELAY AVERAGE_DEP_DELAY DISTANCE AVERAGE_OCCUPANCY_RATE AVERAGE_ITIN_FARE TOTAL_FLIGHTS AVERAGE_FARE_BOOKINGS AVERAGE_BAGGAGE_FEES AVERAGE_AIRPLANE_COST AVERAGE_ARR_DELAY_COST AVERAGE_DEP_DELAY_COST
0 ABE-ATL ATL ABE 2.557604 6.873272 692.0 0.654332 557.141176 434 72911.037463 9160.645161 6352.56 0.0 0.000000
1 ABE-CLT ABE CLT 1.468127 3.838645 481.0 0.661096 486.394737 502 64310.685783 9255.338645 4415.58 0.0 0.000000
2 ABE-DTW ABE DTW 5.981855 13.663984 425.0 0.645392 385.809524 497 49799.703363 9035.492958 3901.50 0.0 0.000000
3 ABE-FLL ABE FLL 4.250000 5.400000 1041.0 0.605250 248.734375 40 30109.296094 8473.500000 9556.38 0.0 0.000000
4 ABE-ORD ABE ORD 23.666667 29.275000 654.0 0.649094 548.354839 320 71186.739718 9087.312500 6003.72 650.0 1070.625000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2930 STL-TPA STL TPA 5.073810 11.299287 869.0 0.658480 375.192982 421 49411.400758 9218.717340 7977.42 0.0 0.000000
2931 STL-TUL STL TUL 0.265432 6.193846 351.0 0.646308 392.350000 325 50715.764615 9048.307692 3222.18 0.0 0.000000
2932 SYR-TPA TPA SYR 9.500000 22.486111 1104.0 0.615556 308.512500 72 37981.316667 8617.777778 10134.72 0.0 561.458333
2933 TPA-TTN TTN TPA 7.557692 14.987261 955.0 0.639427 162.054054 157 20724.339473 8951.974522 8766.90 0.0 0.000000
2934 VPS-XNA VPS XNA 24.760000 20.360000 601.0 0.605600 270.900000 25 32811.408000 8478.400000 5517.18 732.0 402.000000

2935 rows × 14 columns

In [25]:
#merging 'ORIGIN_SIZE' based on 'ORIGIN' and 'IATA_CODE'
origin_size = filtered_airport_codes[['IATA_CODE', 'TYPE']].rename(columns={'TYPE': 'ORIGIN_SIZE'})
round_trips = pd.merge(round_trips, origin_size, left_on='ORIGIN', right_on='IATA_CODE', how='left')
round_trips.drop('IATA_CODE', axis=1, inplace=True)

#merging 'DESTINATION_SIZE' based on 'DESTINATION' and 'IATA_CODE'
destination_size = filtered_airport_codes[['IATA_CODE', 'TYPE']].rename(columns={'TYPE': 'DESTINATION_SIZE'})
round_trips = pd.merge(round_trips, destination_size, left_on='DESTINATION', right_on='IATA_CODE', how='left')
round_trips.drop('IATA_CODE', axis=1, inplace=True)

round_trips
Out[25]:
ROUNDTRIP_ROUTE ORIGIN DESTINATION AVERAGE_ARR_DELAY AVERAGE_DEP_DELAY DISTANCE AVERAGE_OCCUPANCY_RATE AVERAGE_ITIN_FARE TOTAL_FLIGHTS AVERAGE_FARE_BOOKINGS AVERAGE_BAGGAGE_FEES AVERAGE_AIRPLANE_COST AVERAGE_ARR_DELAY_COST AVERAGE_DEP_DELAY_COST ORIGIN_SIZE DESTINATION_SIZE
0 ABE-ATL ATL ABE 2.557604 6.873272 692.0 0.654332 557.141176 434 72911.037463 9160.645161 6352.56 0.0 0.000000 large_airport medium_airport
1 ABE-CLT ABE CLT 1.468127 3.838645 481.0 0.661096 486.394737 502 64310.685783 9255.338645 4415.58 0.0 0.000000 medium_airport large_airport
2 ABE-DTW ABE DTW 5.981855 13.663984 425.0 0.645392 385.809524 497 49799.703363 9035.492958 3901.50 0.0 0.000000 medium_airport large_airport
3 ABE-FLL ABE FLL 4.250000 5.400000 1041.0 0.605250 248.734375 40 30109.296094 8473.500000 9556.38 0.0 0.000000 medium_airport large_airport
4 ABE-ORD ABE ORD 23.666667 29.275000 654.0 0.649094 548.354839 320 71186.739718 9087.312500 6003.72 650.0 1070.625000 medium_airport large_airport
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2930 STL-TPA STL TPA 5.073810 11.299287 869.0 0.658480 375.192982 421 49411.400758 9218.717340 7977.42 0.0 0.000000 large_airport large_airport
2931 STL-TUL STL TUL 0.265432 6.193846 351.0 0.646308 392.350000 325 50715.764615 9048.307692 3222.18 0.0 0.000000 large_airport large_airport
2932 SYR-TPA TPA SYR 9.500000 22.486111 1104.0 0.615556 308.512500 72 37981.316667 8617.777778 10134.72 0.0 561.458333 large_airport large_airport
2933 TPA-TTN TTN TPA 7.557692 14.987261 955.0 0.639427 162.054054 157 20724.339473 8951.974522 8766.90 0.0 0.000000 medium_airport large_airport
2934 VPS-XNA VPS XNA 24.760000 20.360000 601.0 0.605600 270.900000 25 32811.408000 8478.400000 5517.18 732.0 402.000000 large_airport medium_airport

2935 rows × 16 columns

In [26]:
#getting unique value counts in ORIGIN_SIZE 
round_trips['ORIGIN_SIZE'].value_counts()
Out[26]:
ORIGIN_SIZE
large_airport     2413
medium_airport     482
Name: count, dtype: int64
In [27]:
#finding percent of nulls in the origin_size column
origin_size_percentage = (round_trips['ORIGIN_SIZE'].isnull().sum() / len(round_trips['ORIGIN_SIZE'])) * 100

#printing the output
print(f"Percentage of null values in 'ORIGIN_SIZE' column is {origin_size_percentage:.2f}%")
Percentage of null values in 'ORIGIN_SIZE' column is 1.36%
In [28]:
#getting unique value counts in DESTINATION_SIZE
round_trips['DESTINATION_SIZE'].value_counts()
Out[28]:
DESTINATION_SIZE
large_airport     2402
medium_airport     480
Name: count, dtype: int64
In [29]:
#finding percent of nulls in the origin_size column
destination_size_percentage = (round_trips['DESTINATION_SIZE'].isnull().sum() / len(round_trips['DESTINATION_SIZE'])) * 100

#printing the output
print(f"Percentage of null values in 'DESTINATION_SIZE' column is {destination_size_percentage:.2f}%")
Percentage of null values in 'DESTINATION_SIZE' column is 1.81%

Data Quality Insight: 1% of the rows in the ORIGIN_SIZE columm are NULL and about 2% of the rows in the DESTINATION column are NULL. Since about 82% of the tpyes in both columns are large airports we can replace these NULL values in both with large_airport.

In [30]:
#replacing null values in 'ORIGIN_SIZE' and 'DESTINATION_SIZE' with 'large_airport'
round_trips['ORIGIN_SIZE'].fillna('large_airport', inplace=True)
round_trips['DESTINATION_SIZE'].fillna('large_airport', inplace=True)
In [31]:
#creating function to calculate airport cost based on airport size
def calculate_airport_cost(size):
    if size == 'medium_airport':
        return 5000
    elif size == 'large_airport':
        return 10000
    else:
        return None

#creating 'ORIGIN_AIRPORT_COST' column based on 'ORIGIN_SIZE'
round_trips['ORIGIN_AIRPORT_COST'] = round_trips['ORIGIN_SIZE'].apply(calculate_airport_cost)

#creating 'DESTINATION_AIRPORT_COST' column based on 'DESTINATION_SIZE'
round_trips['DESTINATION_AIRPORT_COST'] = round_trips['DESTINATION_SIZE'].apply(calculate_airport_cost)

round_trips
Out[31]:
ROUNDTRIP_ROUTE ORIGIN DESTINATION AVERAGE_ARR_DELAY AVERAGE_DEP_DELAY DISTANCE AVERAGE_OCCUPANCY_RATE AVERAGE_ITIN_FARE TOTAL_FLIGHTS AVERAGE_FARE_BOOKINGS AVERAGE_BAGGAGE_FEES AVERAGE_AIRPLANE_COST AVERAGE_ARR_DELAY_COST AVERAGE_DEP_DELAY_COST ORIGIN_SIZE DESTINATION_SIZE ORIGIN_AIRPORT_COST DESTINATION_AIRPORT_COST
0 ABE-ATL ATL ABE 2.557604 6.873272 692.0 0.654332 557.141176 434 72911.037463 9160.645161 6352.56 0.0 0.000000 large_airport medium_airport 10000 5000
1 ABE-CLT ABE CLT 1.468127 3.838645 481.0 0.661096 486.394737 502 64310.685783 9255.338645 4415.58 0.0 0.000000 medium_airport large_airport 5000 10000
2 ABE-DTW ABE DTW 5.981855 13.663984 425.0 0.645392 385.809524 497 49799.703363 9035.492958 3901.50 0.0 0.000000 medium_airport large_airport 5000 10000
3 ABE-FLL ABE FLL 4.250000 5.400000 1041.0 0.605250 248.734375 40 30109.296094 8473.500000 9556.38 0.0 0.000000 medium_airport large_airport 5000 10000
4 ABE-ORD ABE ORD 23.666667 29.275000 654.0 0.649094 548.354839 320 71186.739718 9087.312500 6003.72 650.0 1070.625000 medium_airport large_airport 5000 10000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2930 STL-TPA STL TPA 5.073810 11.299287 869.0 0.658480 375.192982 421 49411.400758 9218.717340 7977.42 0.0 0.000000 large_airport large_airport 10000 10000
2931 STL-TUL STL TUL 0.265432 6.193846 351.0 0.646308 392.350000 325 50715.764615 9048.307692 3222.18 0.0 0.000000 large_airport large_airport 10000 10000
2932 SYR-TPA TPA SYR 9.500000 22.486111 1104.0 0.615556 308.512500 72 37981.316667 8617.777778 10134.72 0.0 561.458333 large_airport large_airport 10000 10000
2933 TPA-TTN TTN TPA 7.557692 14.987261 955.0 0.639427 162.054054 157 20724.339473 8951.974522 8766.90 0.0 0.000000 medium_airport large_airport 5000 10000
2934 VPS-XNA VPS XNA 24.760000 20.360000 601.0 0.605600 270.900000 25 32811.408000 8478.400000 5517.18 732.0 402.000000 large_airport medium_airport 10000 5000

2935 rows × 18 columns

In [32]:
#Calculating average/total revenue for each roundtrip route
round_trips['AVERAGE_TOTAL_REVENUE'] = (round_trips['AVERAGE_FARE_BOOKINGS'] + round_trips['AVERAGE_BAGGAGE_FEES'])
round_trips['TOTAL_REVENUE'] = (round_trips['AVERAGE_FARE_BOOKINGS'] + round_trips['AVERAGE_BAGGAGE_FEES'])* round_trips['TOTAL_FLIGHTS']

#calculating average/total expenses for each roundtrip route
round_trips['AVERAGE_TOTAL_EXPENSE'] = (round_trips['AVERAGE_AIRPLANE_COST'] + 
    round_trips['AVERAGE_ARR_DELAY_COST'] + 
    round_trips['AVERAGE_DEP_DELAY_COST'] + 
    round_trips['ORIGIN_AIRPORT_COST'] + 
    round_trips['DESTINATION_AIRPORT_COST'])
round_trips['TOTAL_EXPENSE'] = (
    (round_trips['AVERAGE_AIRPLANE_COST'] + 
    round_trips['AVERAGE_ARR_DELAY_COST'] + 
    round_trips['AVERAGE_DEP_DELAY_COST'] + 
    round_trips['ORIGIN_AIRPORT_COST'] + 
    round_trips['DESTINATION_AIRPORT_COST'])* round_trips['TOTAL_FLIGHTS'])

#calculating average/total profit for eaxch rountrip route
round_trips['AVERAGE_PROFIT'] = round_trips['AVERAGE_TOTAL_REVENUE'] - round_trips['AVERAGE_TOTAL_EXPENSE']
round_trips['TOTAL_PROFIT'] = round_trips['TOTAL_REVENUE'] - round_trips['TOTAL_EXPENSE']

round_trips
Out[32]:
ROUNDTRIP_ROUTE ORIGIN DESTINATION AVERAGE_ARR_DELAY AVERAGE_DEP_DELAY DISTANCE AVERAGE_OCCUPANCY_RATE AVERAGE_ITIN_FARE TOTAL_FLIGHTS AVERAGE_FARE_BOOKINGS ... ORIGIN_SIZE DESTINATION_SIZE ORIGIN_AIRPORT_COST DESTINATION_AIRPORT_COST AVERAGE_TOTAL_REVENUE TOTAL_REVENUE AVERAGE_TOTAL_EXPENSE TOTAL_EXPENSE AVERAGE_PROFIT TOTAL_PROFIT
0 ABE-ATL ATL ABE 2.557604 6.873272 692.0 0.654332 557.141176 434 72911.037463 ... large_airport medium_airport 10000 5000 82071.682624 3.561911e+07 21352.560000 9267011.04 60719.122624 2.635210e+07
1 ABE-CLT ABE CLT 1.468127 3.838645 481.0 0.661096 486.394737 502 64310.685783 ... medium_airport large_airport 5000 10000 73566.024429 3.693014e+07 19415.580000 9746621.16 54150.444429 2.718352e+07
2 ABE-DTW ABE DTW 5.981855 13.663984 425.0 0.645392 385.809524 497 49799.703363 ... medium_airport large_airport 5000 10000 58835.196321 2.924109e+07 18901.500000 9394045.50 39933.696321 1.984705e+07
3 ABE-FLL ABE FLL 4.250000 5.400000 1041.0 0.605250 248.734375 40 30109.296094 ... medium_airport large_airport 5000 10000 38582.796094 1.543312e+06 24556.380000 982255.20 14026.416094 5.610566e+05
4 ABE-ORD ABE ORD 23.666667 29.275000 654.0 0.649094 548.354839 320 71186.739718 ... medium_airport large_airport 5000 10000 80274.052218 2.568770e+07 22724.345000 7271790.40 57549.707218 1.841591e+07
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2930 STL-TPA STL TPA 5.073810 11.299287 869.0 0.658480 375.192982 421 49411.400758 ... large_airport large_airport 10000 10000 58630.118098 2.468328e+07 27977.420000 11778493.82 30652.698098 1.290479e+07
2931 STL-TUL STL TUL 0.265432 6.193846 351.0 0.646308 392.350000 325 50715.764615 ... large_airport large_airport 10000 10000 59764.072308 1.942332e+07 23222.180000 7547208.50 36541.892308 1.187612e+07
2932 SYR-TPA TPA SYR 9.500000 22.486111 1104.0 0.615556 308.512500 72 37981.316667 ... large_airport large_airport 10000 10000 46599.094444 3.355135e+06 30696.178333 2210124.84 15902.916111 1.145010e+06
2933 TPA-TTN TTN TPA 7.557692 14.987261 955.0 0.639427 162.054054 157 20724.339473 ... medium_airport large_airport 5000 10000 29676.313996 4.659181e+06 23766.900000 3731403.30 5909.413996 9.277780e+05
2934 VPS-XNA VPS XNA 24.760000 20.360000 601.0 0.605600 270.900000 25 32811.408000 ... large_airport medium_airport 10000 5000 41289.808000 1.032245e+06 21651.180000 541279.50 19638.628000 4.909657e+05

2935 rows × 24 columns

In [33]:
#finding the 10 most profitable routes based on 'AVERAGE_TOTAL_PROFIT'
top_10_routes = round_trips.nlargest(10, 'TOTAL_PROFIT')

top_10_routes[['ROUNDTRIP_ROUTE', 'TOTAL_PROFIT']]
Out[33]:
ROUNDTRIP_ROUTE TOTAL_PROFIT
2076 JFK-LAX 5.740261e+08
2103 JFK-SFO 2.832267e+08
1619 EWR-SFO 2.553984e+08
2273 LGA-ORD 2.048425e+08
2230 LAX-SFO 2.006740e+08
189 ATL-LGA 1.923855e+08
1153 DCA-ORD 1.877187e+08
186 ATL-LAX 1.788756e+08
564 BOS-LGA 1.643118e+08
1137 DCA-LGA 1.632471e+08
In [34]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Profit
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_PROFIT', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_PROFIT': 'Profit'},
             title='10 Most Profitable Round Trip Routes by Total Profit')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [35]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Profit
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_PROFIT', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_PROFIT': 'Profit'},
             title='10 Most Profitable Round Trip Routes by Average Profit')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [36]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Revenue
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_REVENUE', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_REVENUE': 'Revenue'},
             title='10 Most Profitable Round Trip Routes by Total Revenue')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [37]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Expense
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_EXPENSE', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_EXPENSE': 'Expense'},
             title='10 Most Profitable Round Trip Routes by Total Expense')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [38]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Flights
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_FLIGHTS', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_FLIGHTS': 'Flights'},
             title='10 Most Profitable Round Trip Routes by Total Flights')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [39]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Total Fare Bookings
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_FARE_BOOKINGS', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_FARE_BOOKINGS': 'Flights'},
             title='10 Most Profitable Round Trip Routes by Average Total Fare Bookings')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()
In [40]:
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Airplane Costs
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_AIRPLANE_COST', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_AIRPLANE_COST': 'Airplane Costs'},
             title='10 Most Profitable Round Trip Routes by Average Airplane Costs')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()

Ask #3¶

Top 5 Recommended Round Trip Routes To Invest In

  1. JFK-LAX:
    • Total Profit: \$574,026,100
    • Total Revenue: \$844,019,700
    • Total Flights: 6,320
    • Reason: This route shows the highest total profit, indicating good profit potential. The high total revenue and a considerable number of flights further support its attractiveness for investment.
  1. JFK-SFO:
    • Total Profit: \$283,226,700
    • Total Revenue: \$446,080,100
    • Total Flights: 3,721
    • Reason: This route also has a substantial total profit and significant total revenue, making it a promising investment opportunity.
  1. EWR-SFO:
    • Total Profit: \$255,398,400
    • Total Revenue: \$362,136,800
    • Total Flights: 2,424
    • Reason: This route shows good profitability, and the total revenue is quite substantial despite having fewer flights compared to other routes.
  1. ATL-LAX:
    • Total Profit: \$178,875,600
    • Total Revenue: \$299,994,900
    • Total Flights: 3,198
    • Reason: Despite having substantially less total profit than the top 3 routes in this list, ATL-LAX has the 4th largest average profit per flight. Atlanta and Los Angeles are also major cities with strong economic activity and tourism, making the route attractive for both business and leisure travelers. Investing in increasing number of flights for this route might prove fruitful in the longterm.
  1. DCA-ORD:
    • Total Profit: \$163,247,100
    • Total Revenue: \$237,025,900
    • Total Flights: 3,359
    • Reason: Despite having substantially less total profit than the top 3 routes in this list, DCA-ORD has the 5th largest average profit per flight. Washington D.C. and Chicago are also critical business and political centers, which attract a steady flow of business travelers and government officials. Investing in increasing number of flights for this route might prove fruitful in the longterm.

Ask #4¶

In [41]:
#creating a dataframe with only the specified roundtrip routes
selected_routes = ['JFK-LAX', 'JFK-SFO', 'EWR-SFO', 'ATL-LAX', 'DCA-ORD']
selected_top_routes = top_10_routes[top_10_routes['ROUNDTRIP_ROUTE'].isin(selected_routes)].copy()

#calculating the number of flights needed to break even on the upfront airplane cost
selected_top_routes['FLIGHTS_TO_BREAK_EVEN'] = 90000000 / selected_top_routes['AVERAGE_PROFIT']

#rounding breakeven values to whole numbers
selected_top_routes['FLIGHTS_TO_BREAK_EVEN'] = selected_top_routes['FLIGHTS_TO_BREAK_EVEN'].round(0)

selected_top_routes[['ROUNDTRIP_ROUTE', 'FLIGHTS_TO_BREAK_EVEN']]
Out[41]:
ROUNDTRIP_ROUTE FLIGHTS_TO_BREAK_EVEN
2076 JFK-LAX 991.0
2103 JFK-SFO 1182.0
1619 EWR-SFO 854.0
1153 DCA-ORD 1772.0
186 ATL-LAX 1609.0

Number of Flights to Breakeven for Top 5 Recommended Flights:

  1. JFK-LAX: 991 flights required to breakeven on upfront cost of airplane.
  2. JFK-SFO: 1182 flights required to breakeven on upfront cost of airplane.
  3. EWR-SFO: 854 flights required to breakeven on upfront cost of airplane.
  4. DCA-ORD: 1772 flights required to breakeven on upfront cost of airplane.
  5. ATL-LAX: 1609 flights required to breakevem on upfront cost of airplane.
In [42]:
#creating bar chart for Flights to Breakeven for Top 5 Recommended Flights
fig = px.bar(selected_top_routes, x='ROUNDTRIP_ROUTE', y='FLIGHTS_TO_BREAK_EVEN', text_auto='.2s',
             labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'FLIGHTS_TO_BREAK_EVEN': 'Flights'},
             title='Flights to Breakeven for Top 5 Recommended Flights')

fig.update_layout(
    xaxis=dict(tickfont=dict(size=12), title_standoff=10),
    yaxis=dict(tickfont=dict(size=12), title_standoff=10),
    title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))

fig.show()

Ask #5¶

Key Performance Indicators (KPI’s) To Track (From Dataset):

  1. Average Total Revenue: This shows the average revenue generated per flight for each route. Higher revenues indicate a more profitable route.
  2. Average Total Expense: Lower expenses relative to revenue indicate a more profitable route.
  3. Average Profit: This directly shows the profitability of each route.
  4. Average Occupancy Rate: Higher occupancy rates indicate demand for a route. These five recommended routes have reasonably high occupancy rates (64-65%).
  5. Total Flights - More flights indicates higher demand for a route.
  6. Flights to Break Even - Lower numbers indicate a more profitable route.
  7. Average Itinerary Fare: Higher airfare bookings indicates travelers are willing to pay more for a route, suggesting higher demand.
  8. Average Baggage Fees: More fees collected suggests higher utilization and demand.
  9. Average Departure Delays: Lower delays improve customer experience and indicate an efficiently run route.
  10. Average Arrival Delays: Lower delays improve customer experience and indicate an efficiently run route.

Additional Key Performance Indicators (aKPI’s) To Track (Not Inlcuded In Dataset):

  1. Total Revenue Growth: Tracking revenue growth helps identify growing vs declining routes.
  2. Profit Margin: Higher margins indicate a more profitable route. Can compare to benchmarks.
  3. Airplane Utilization Rates: Higher utilization indicates demand and efficient operations.
  4. Customer Satisfaction Scores: Happier customers make routes more sustainable long-term.
  5. Employee Satisfaction Scores: Good employee morale improves service and reliability.

Metadata For New Fields¶

In [43]:
#creating metadata as a dictionary
metadata_dict = {
  'ROUNDTRIP_ROUTE': 'The roundtrip flight route between the origin and destination airports.',
  'AVERAGE_ARR_DELAY': 'The average arrival delay time in minutes for a rountrip route.',
  'AVERAGE_DEP_DELAY': 'The average departure delay time in minutes for a rountrip route.',
  'AVERAGE_OCCUPANCY_RATE': 'The average percentage of seats filled for a rountrip route.',
  'AVERAGE_ITIN_FARE': 'The average rountrip itinerary fare for a rountrip route.',
  'TOTAL_FLIGHTS': 'The total number of flights for a rountrip route.',
  'AVERAGE_FARE_BOOKINGS': 'The average total itinerary fare revenue. Caculated from (Average Occupancy Rate * 200) multiplied by Average Itinerary Fare.',
  'AVERAGE_BAGGAGE_FEES': 'The average baggage fee revenue collected per rountrip route. This includes origin and destination in calculation.',
  'AVERAGE_AIRPLANE_COST': 'The average operational cost per flight for the airplane servicing a rountrip route. This incudes the cost of fuel, oil, maintenance, crew, depreciation, insurance, etc. (per mile).',
  'AVERAGE_ARR_DELAY_COST': 'The average arrival delay cost incurred for a rountrip route.',
  'AVERAGE_DEP_DELAY_COST': 'The average departure delay cost incurred for a rountrip route.',
  'ORIGIN_SIZE': 'The size of the origin airport - either medium or large.',
  'DESTINATION_SIZE': 'The size of the destination airport - medium or large.',
  'ORIGIN_AIRPORT_COST': 'The fixed operating cost for the origin airport.',
  'DESTINATION_AIRPORT_COST': 'The fixed operating cost for the destination airport.',
  'AVERAGE_TOTAL_REVENUE': 'The average total revenue generated per roundtrip route.',
  'TOTAL_REVENUE': 'The total revenue generated by the total flights recorded for a rountrip route.',
  'AVERAGE_TOTAL_EXPENSE': 'The average total cost per flight on this route.',
  'TOTAL_EXPENSE': 'The total cost of the total flights recorded for a rountrip route.',
  'AVERAGE_PROFIT': 'The average profit per flight on this route.',
  'TOTAL_PROFIT': 'The total profit from the total flights recorded for a rountrip route.',
  'FLIGHTS_TO_BREAK_EVEN': 'The estimated number of flights needed for this route to break-even with respects to the upfront cost of the airplane ($90 million).'
}


#creating metadata dataframe from metadata dictionary
metadata = pd.DataFrame(list(metadata_dict.items()), columns=['New Column', 'Description'])

# Setting the maximum column width to 'None' to make it easier to view metadata Descriptions
pd.options.display.max_colwidth = None

metadata
Out[43]:
New Column Description
0 ROUNDTRIP_ROUTE The roundtrip flight route between the origin and destination airports.
1 AVERAGE_ARR_DELAY The average arrival delay time in minutes for a rountrip route.
2 AVERAGE_DEP_DELAY The average departure delay time in minutes for a rountrip route.
3 AVERAGE_OCCUPANCY_RATE The average percentage of seats filled for a rountrip route.
4 AVERAGE_ITIN_FARE The average rountrip itinerary fare for a rountrip route.
5 TOTAL_FLIGHTS The total number of flights for a rountrip route.
6 AVERAGE_FARE_BOOKINGS The average total itinerary fare revenue. Caculated from (Average Occupancy Rate * 200) multiplied by Average Itinerary Fare.
7 AVERAGE_BAGGAGE_FEES The average baggage fee revenue collected per rountrip route. This includes origin and destination in calculation.
8 AVERAGE_AIRPLANE_COST The average operational cost per flight for the airplane servicing a rountrip route. This incudes the cost of fuel, oil, maintenance, crew, depreciation, insurance, etc. (per mile).
9 AVERAGE_ARR_DELAY_COST The average arrival delay cost incurred for a rountrip route.
10 AVERAGE_DEP_DELAY_COST The average departure delay cost incurred for a rountrip route.
11 ORIGIN_SIZE The size of the origin airport - either medium or large.
12 DESTINATION_SIZE The size of the destination airport - medium or large.
13 ORIGIN_AIRPORT_COST The fixed operating cost for the origin airport.
14 DESTINATION_AIRPORT_COST The fixed operating cost for the destination airport.
15 AVERAGE_TOTAL_REVENUE The average total revenue generated per roundtrip route.
16 TOTAL_REVENUE The total revenue generated by the total flights recorded for a rountrip route.
17 AVERAGE_TOTAL_EXPENSE The average total cost per flight on this route.
18 TOTAL_EXPENSE The total cost of the total flights recorded for a rountrip route.
19 AVERAGE_PROFIT The average profit per flight on this route.
20 TOTAL_PROFIT The total profit from the total flights recorded for a rountrip route.
21 FLIGHTS_TO_BREAK_EVEN The estimated number of flights needed for this route to break-even with respects to the upfront cost of the airplane ($90 million).